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In this paper we study the quantum generalisation of the skew divergence, which is 
a dissimilarity measure between distributions introduced by L. Lee in the context 
of natural language processing. We also introduce the differential skew divergence, 
which is a closely related concept with the benefit of an additional symmetry. We 
provide an in-depth study of both the quantum skew divergence and differential 
skew divergence, including their relation to other state distinguishability measures. 
Finally, we present a number of important applications: new continuity inequalities 
for the quantum Jensen-Shannon divergence and the Holevo information, and a new 
and short proof of Bravyi's Small Incremental Mixing conjecture. 
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£T) ■ 1 Introduction 



The quantum relative entropy of two density operators p and a, denoted 



S(p\\a) = Trp(logp — logo"), was first introduced by Umegaki [32] in 1962. 
Since the 90's it gained in popularity, especially in the quantum information 
theory community, when Hiai and Petz p3| showed that Umegaki's formula 
provided the proper quantum generalisation of the classical Kullback-Leibler 
divergence KL(p||g) of two probability distributions, as an operational mea- 
sure of dissimilarity between quantum states. A lot of research has been spent 
exploring its mathematical and physical properties. Despite having many uni- 
versally useful features, the relative entropy exhibits certain properties that in 
some applications may be considered as drawbacks. In particular, the relative 
entropy is not a distance measure in the mathematical sense of the word: it 
is asymmetric with respect to interchanging arguments, S(p\\o~) ^ S(a\\p), 
and it does not satisfy a triangle inequality. Moreover, the relative entropy 
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is infinite whenever the support of a is not contained in the support of p. 
This makes the relative entropy completely unsuitable as a distance measure 
between pure states, for example. We will refer to this feature as the 'infinity 
problem'. 

Over the years, several modifications to the relative entropy have been pro- 
posed. Some of the better known modifications are the Quantum Jensen- 
Shannon divergence [T2l[T3] . and the closely related Holevo information or 
Holevo x [15.24] (even though this is not usually considered as a modifica- 
tion of the relative entropy in the QIT community because it serves entirely 
different purposes). 

In the present paper we introduce another modification of the quantum relative 
entropy, which we call the quantum skew divergence. We have coined this 
terrri_J because of its close similarity to the already existing classical concept 
of skew divergence of two probability distributions, which was introduced by 
Lee [T6]fT7] in the context of natural language processing to overcome the 
infinity problem for the Kullback-Leibler divergence. As no confusion will arise 
we will henceforth refer to the quantum skew divergence as skew divergence 
(SD) for short. It is not to be confused with the Wigner-Yanase- Dyson skew 
information and related notions, to which it bears no obvious resemblance. 

The skew divergence is essentially the relative entropy but with 'skewed' sec- 
ond argument. That is, the second argument a is replaced by the convex combi- 
nation ap + (1 — a)a, where a is a scalar (0 < a < 1) which we call the skewing 
parameter. As one of its basic properties we will show that S(p\ \ap + (1 — a)a) 
is no longer infinite but is bounded above by — log a, and we define the skew 
divergence as the skewed relative entropy divided by this factor —log a: 

SDM\o):=^—S(p\\ap+(l-a)(r). 
— log a 

Hence, SD always takes values between and 1. It is to be noted that Lee's 
skew divergence does not have this normalisation factor. 

In this paper we provide an in-depth study of the mathematical properties of 
the skew divergence, many of which are unexpected and useful. Motivated by 
the mathematics, we also introduce a second concept, namely the differential 
skew divergence (DSD), which is the skewed relative entropy differentiated 
with respect to —log a (not divided by it). The DSD can also be interpreted 
as a skewed version of one of the quantum ^-divergences introduced in [30] , 



1 A preliminary version of this work has already been presented at TQC-2011, 
Madrid [I], but as we were then unaware of Lee's work the quantity came with 
another name, namely 'telescopic relative entropy'. We now feel that quantum skew 
divergence is a more informative name. 



namely the one induced by the logarithm. It turns out that the DSD shares 
many properties with the SD, that these properties are more easily proven 
for the DSD and that the corresponding properties of the SD follow almost 
trivially by a simple averaging argument. Moreover, the DSD satisfies a certain 
symmetry property which the SD does not, and which will turn out to be 
crucial in applications. 

This paper can be subdivided roughly in two parts: the first part is a theo- 
retical study of the properties of the skew divergence, and the second part is 
on applications. The first part consists of four sections. In Sections [2] and [3] 
we give precise definitions for the skew divergence and differential skew diver- 
gence, respectively, and state and prove their basic properties. 

Section @] is devoted to the more complicated continuity properties of SD 
and DSD. These are properties that have no counterparts for the relative 
entropy, as a direct consequence of the infinity problem. We show continuity 
in two different senses: first, we show that states that are close in trace norm 
distance are also close in SD. Secondly, we show that the SD and DSD are 
also continuous with respect to perturbations of each of their arguments. The 
proofs of the latter continuity results rely on a very technical result - with an 
even more technical and highly non-trivial proof - about the derivative of the 
operator logarithm, and this is presented in Section |5j We suggest that this 
Section might be skipped on first reading. 

In the second part of this paper we consider applications of the (differential) 
skewed divergence. The first application, as discussed in Section El is of course 
as a dissimilarity measure between quantum states, as this was the original 
purpose for introducing the skew divergence. Here we give a detailed overview 
of the relative entropy's drawbacks and of the various proposals that have 
been made in the literature and how the skew divergence fits in. 

However, there is more to the skew divergence than just being another dissim- 
ilarity measure. In Section [7J we note its close connection to the generalised 
quantum Jensen- Shannon divergence (QJS), i.e. the Holevo information, and 
by exploiting the sharp continuity estimates for the SD derived in this paper, 
we obtain new continuity-type bounds for the QJS and the Holevo information 
that in many cases improve on existing estimates from the literature. 

Finally, in Section |H] we give a simple proof of the so-called Small Incremental 
Mixing Conjecture that was postulated by Bravyi [7J and recently proven 
by Van Acoleyen [31]. Our proof yields a better proportionality constant (2 
instead of 9) and may yield additional insight in the more general 'mixing 
problem' proposed by Lieb and Vershynina [2TJ . 




2 Quantum Skew Divergence 



In this section we introduce the quantum generalisation of the skew divergence 
(SD) and state and prove its basic properties. 

First, let us recall the definition of the relative entropy 
states p and a, both positive, 

S(p\\a) := Trp(logp-loga). 



For non- normalised positive operators A and B, one defines 

S(A\\B) :=TrA(\ogA-\ogB)-Tr(A-B). (2) 

For positive scalars a and b, we will also write 

S(a\b):=a(\oga-\ogb)-(a-b). (3) 

Strictly speaking, when a (or B) has a non-trivial kernel, the relative entropy is 
no longer defined. However, when the supports of p and a satisfy the condition 
suppp C supper one customarily adopts the convention that '0 + logO + = 0' 
and redefines the relative entropy as 

S{p\\a):=S{p\ a \\a\ a ), 

S(A\\B):=S(A\ B \\B\ B ), 

where the symbol A\b denotes the restriction of A to the support of B. When 
supp p % supp a this redefinition is not possible and one says that the relative 
entropy is infinite. 

The quantum skew divergence is based on the functional S(p\\ap+ (1 — a)a), 
or S'(>1||q!>1+ (1 — a)B) in the non-normalised case, where a is a scalar, with 
< a < 1. Since, for all such a, supp(y4) C supp(av4+ (1 — a)B), no problem 
of infinities arises. Henceforth, we will always write S'(j4||a;^4 + (1 — a)B), 
whether for A, B > or for A, B > 0. In the latter case this is to mean 
S(A\ A+B \\(aA + (1 - a)B)\ A+B )- (supp(WL + (1 - a)B) = supp(A + B) is 
independent of the value of a over (0, 1)). 

Definition 1 For fixed a G (0,1), the quantum a-skew divergence between 
states p and a is given by 

SD Q (p||cr) := _ lQ 1 g(a) S{p\\ap+{1 - a)a). (4) 



Likewise, for non-normalised operators A,B>0, 

SD a (A\\B):= - —. . S(A\\aA + (l-a)B). (5) 

- log(a) 



We call a the skewing parameter. 

The reason for incorporating the scale factor l/(— log a) is to normalise the 
range of the SD to the interval [0, 1]. 

Theorem 1 For all states p and a and < a < 1, 

0<SD Q (p||a)<l, 

and SD a (p||cr) = 1 if and only if p _L a. 

Recall that two quantum states are mutually orthogonal, denoted p _L a, iff 
Trpa = 0. 

Proof. Let r = ap + (1 — a)o~. By operator monotonicity of the logarithm, we 
have 

log(r) = log(ap + (1 - a)a) > log(ap), 

and, therefore, 

S(p\\r) = Tr p(log p - log r) 

<Trp(logp-log(ap)) 
= — logo;. 

Thus, S(p\\r) is bounded above by —log a, which is finite for < a < 1. It 
therefore makes perfect sense to normalise S(p\\r) by dividing it by —log a, 
producing a quantity that is always between and 1. 

The equality case was proven in [Tj. □ 

The definition of the skew divergence for non-normalised operators is also 
applicable to non-negative scalars. To distinguish the scalar case more clearly 
from the matrix case we will use the symbol SD Q (6|c) for scalars; 

gD / 6 , x = &(log b ~ logjab + (1 - a)c)) - (1 - a) {b - c) 

— log a 

As we do not restrict the arguments of the SD to normalised states, the fol- 
lowing scaling identities can be useful. 

Theorem 2 For < a < 1, operators X, Y > 0, and positive scalars b, c, 



SD a (bX\\bY) = bSD a (X\\Y) (7) 

SD a (bX\\cX) = SD a (b\c)TrX. (8) 

This is easy to prove by simple calculation. 

The quantum skew divergence inherits many desirable properties from the 
relative entropy: 

Theorem 3 For < a < 1, states p, a, any unitary matrix U and any 
completely positive trace-preserving (CPTP) map $, 

(1) Positivity: SD Q (p||cr) > 0, and SD a (p||<x) = if and only if p = a; 

(2) Unitary invariance: SD a (UpU*\\UaU*) = SD a (p||cr); 

(3) Contractivity: SD a ($(p)||$(<r)) < SD a (p||<r); 

(4) Joint convexity: the map (p, cr) H- SD a (p||cr) is jointly convex. 

The proof is again straightforward. Note that these are the same properties 
that the quantum Jensen- Shannon divergence obeys [23]. 

One property of the relative entropy that is lost by skewing is additivity under 
taking tensor products. For the relative entropy we have 

S(pi ® P2\Wi ® cr 2 ) = S(pi\\a 1 ) + S(p 2 \\o- 2 ). 

In particular, for tensor powers p® n :— p (g) • • • (g) p, 

S(p® n \\a® n )=nS(p\\o-). 

This property is clearly lost by skewing, as it would violate SD a (p||(j) < 1. 
Additivity still holds in the special case p 2 = o 2 \ that is, for all p, a, r, 

SD Q (p®r||a(g)r) = SD a (p\\a). 

However, in the asymptotic regime (n — > oo) a property emerges that is quite 
close to additivity. Some care must be taken to obtain non-trivial limits. If we 
let the skewing parameter a vary with n as well, we obtain: 

Theorem 4 For all states p and a, and all < a < 1, 

lim SD Q n(pH|a 0n ) < min I 1, — r — S(p\\<r)) . (9) 

n ^°° y — log a J 

Proof. By the operator monotonicity of the logarithm, 

Tr p 0n log(a n p® n + (1 - a n )<j® n ) 

> max(Trp® n log(a n p^ n ), Trp^ n log((l - a n )a m )), 



therefore 

S{p m \\a n p m + (1 - a n )a® n ) <min(-nlog(a), -log(l - a n ) + nS(p\\a)). 
In the limit n — > oo, 



fim -^(p® n ||a> 0n + (1 - a n )a® n ) <min(-log(a),S(p||a)). 



n 



Numerical studies have led us to believe that in fact equality might always 
hold in ([9]). For commuting, finite-dimensional p and a this is not so hard to 
prove. The general case, however, still evades us. 



3 Differential Skew Divergence 



The proofs of the properties of SD are almost always made simpler by first 
proving the corresponding properties for a related quantity, which we call the 
differential skew divergence (DSD). Here, rather than dividing S(A\ \aA+ (1 — 
a)B) by —log a, the derivative is taken w.r.t. —log a. We will see that this 
quantity has also many uses in its own right. In particular, it proved to be the 
essential concept in our proof of the Bravyi Conjecture reported in Section [BJ 

It is well-known that the relative entropy 5(A||5) is differentiable w.r.t. A 
and B whenever A,B>0. Hence, for A, B > 0, the function a h-> S^HaA + 
(1 — a)B) is differentiable over the open interval (0, 1). For A, B > this is no 
longer true as the relative entropy is in general only lower semicontinuous [31] . 
However, thanks to the convention adopted above to restrict A and B to the 
support of A + B, the function a H- S'(>l||a>l + (1 — a)B) is still differentiable 
for A, B > 0. Because of this, the following definition makes sense: 

Definition 2 For A, B > and < a < I, 



DSD a (A\\B) := — 4 ,S(A\\aA + (1 - a)B) (10) 

a{— log a) 

= -a^-S(A\\aA+(l-a)B). (11) 

da 



The skew divergence can be recovered from the differential one by a simple 
averaging procedure. 



Theorem 5 For operators A,B>0 and < a < 1, 



- log a 



SD a (A||S) = — !— / DSD Q ,(A||B)d(-loga'). (12) 

— log a J 



Proof. Define the function /(a) = S(A\ \aA + (1 — a)B). By the substitution 
b = — log a, we can write 



SD a (A||5) = i/(exp(-&)) 
DSD a (A||B) = ^/(«p(-6)). 



Therefore, as for 6 = 0, /(exp(-fo)) = /(l) = 5(A||A) = 0, 



SD a (A||B) = i|^/(exp(-6))(ft 



o 

log a 



1 ' DSD a »(A||S)d(-loga / ) 



— log a 
& o 



which is indeed an average w.r.t. —log a. □ 

This is an important fact, because whenever one has an equality or inequality 
for the DSD where the parameter a uniquely appears as the skewness param- 
eter, one can immediately obtain the corresponding (in)equality for the SD by 
averaging over a suitable range of — log a. We will encounter applications of 
this technique in the proofs of Theorems El dH [151 and [TBJ 

Note that the DSD is unitarily invariant, just like the SD: for any unitary U, 
BSB a (UAU*\\UBU*) = DSD a (A\\B). Further properties are given in Section 

E21 

Explicit formulas for the DSD can be given by working out the derivative. 
This requires taking the derivative of the operator logarithm. We first collect 
a number of known facts about this derivative. 



3.1 The operator logarithm and its derivative 



The following integral representation of the logarithm lies at the basis of much 
of the subsequent treatment. For x > 0, we have 

loEI = MlT7-JT7)- < 13 > 



Using functional calculus, this definition can be extended to the operator 
logarithm. For A > 0, 

00 
\ogA= fds( 1 - (A H- si)- 1 ). (14) 



From this representation follows a representation of the derivative of the oper- 
ator logarithm. As in [20], let us define for A > the linear map A —¥ 7a(A) 
for self-adjoint A as 



^< A > : = s 



\og(A + tA). (15) 

t=o 



From integral representation (TT3|) we get an integral representation for T A as 
well: 

00 
7i(A) = fds {A + sI)- l A(A + si) -1 . (16) 



For scalar arguments we clearly have 

W) = (J/o. (17) 

From dUD and the fact that log (A + tA) = (1 + t)I + log A it follows that 
7a (A) = I. From ffTBT) it follows that, for any A > 0, 7^ is a completely 
positive map. In particular, it preserves the positive semidefinite order; that 
is, if X < Y, then T A (X) < T A (Y). Also, for X > 0, T A (X) > 0. 

The sesquilinear form 

M A {B,C) := (B*,r A (C)) = TiB*T A (C) (18) 



is a metric: it is positive semidefinite (Ma{B, B) > for any B), Ma{B, B) = 
iff B = 0, and M^(B, B) is continuous in B for any A. As it is contractive 
under CPTP maps $, 

M m) ($(B),$(B))<MA{B,B) 

for any A > and any B, it is a monotone metric [19.26J. This particu- 
lar sesquilinear form is closely related to the Kubo-Mori scalar product from 
quantum statistical mechanics, which is the sequilinear form (5*,7X 1 (C))- 
Lieb has shown that the map (A,B) \— > Ma(B,B), for A > and any B, is 
jointly convex in A and B ([20], Theorem 3). 



Lemma 1 Lei A,B,C > wzi/i suppA C supp P. Then, irrespective of 
sudd C. 



\imM B+eC (A,A) = M B \ B (A\ B ,A\ B ) 



e->0 

Proof. To begin with, we can restrict A and I? in the left-hand side to supp(P+ 
C). Now let P be the projector on supp B and Q the projector on the orthogo- 
nal complement of supp B (within supp(P+C)). Consider the 2x2 partitioning 
induced by P and Q: 



A^ 



( PAP* PAQ* 



\QAP* QAQ* 

and similarly for all other operators. Because of the conditions on the supports, 
we have PAQ* = QAP* = QAQ* = and PBQ* = QBP* = QBQ* = 0. 
Hence, 

Tr AT B+tC (A) 

oo 

= f ds Tr A(B + eC + s)- l A(B + eC + s)' 1 
o 

oo 

= [ds Tr{PAP*) (P(B + eC + s) _1 P*) (PAP*) (P(B + eC + s) _1 P*). 
o 

Using Schur complements, we can find the explicit expression 



P{B + tC + s)- 1 P* 

\PBP* + ePCP* + s) - e 2 PCQ*{eQCQ* + s)- 1 QCP i 

In the limit e — y 0, this simplifies as 

limP(S + eC + s)- x P* = (PPP* + s)-\ 



10 



since all operator blocks appearing here are invertible. Therefore, 

oo 

limTr AT B +ec(A) = I ds Tr( PAP*) (PBP* + sY l (PAP*) (PBP* + s)" 1 



= Tr A\ b Tb\ b (A\b). 

a 

3.2 Properties of the DSD 

We can now state the promised explicit formulas for the DSD. For < a < 1, 
and operators A, B > 0, 

DSD Q (A||B) = a(Tr AT aA+( i- a)B (A - B) - Tr(A - B)) (19) 

= a(l - a) Tr(A - B)7^4 + (i_«)b(j4 - B) (20) 

= -r^— TrAT aA+{ i- a )B( A ) ~ 7^~ Tr A ~ « Tr (^ - £$21) 
1 — a 1 — a 



For the general case A, B > 0, note that, for < a < 1, we always have 
suppv4,suppB C supp(ayl + (1 — a)B), and again we avoid the problem of 
infinities. Moreover, we can still use formulas (TT9l) -( l2~TT) if as before we restrict 
A and B to S = supp(A + B). 

We denote the DSD for scalar arguments by DSD Q (6|c). By flSDJ and (I2T|) . 
explicit formulas are 



DSD Q (6|c) = «(1 - a) ^ f (22) 

ao + (1 — a)c 

" / /r 6)-a(6-c). (23) 



1 — a \ab + (1 — a)c 
In particular, 

DSD Q (6|0) = (1 - a)b, DSD a (0|c) = ac. (24) 

The main reason for introducing the differential form of the SD is that the 
DSD satisfies a very useful symmetry property. 

Theorem 6 For A, B > 0, and < a < 1, 

DSD Q (A||B) = DSD 1 _ a (B||A). (25) 
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Proof. This follows immediately from formula ( |20|) . □ 

From formula ( 120 j) it follows that the DSD can be expressed in terms of the 
metric M as 

DSD Q (A||£?) = a(l - u)M aA+{l _ a)B {A -B,A- B). (26) 



Thus, it follows that the DSD is positive and contractive under CPTP maps. 
For example, with a = Tt A and b = Ti B, we have: 

DSB a (A\\B) < DSD a (a|6). (27) 



The DSD is a skewed version of one of the so-called quantum x 2 -divergences 
introduced by Temme et al [30], namely the one induced by the logarithm. 
This logarithmic quantum x 2 -divergence is defined for A, B > as 

xl g (A, B) := M B (A -B,A-B) = Tr(A - B)T B {A - B). 

A short calculation reveals that 

DSD Q (A||£?) = -— x l g (A,aA+ (1 - a)B). (28) 



This means that certain properties that were proven in [30] for the quantum 
X 2 -divergences carry over to the DSD. One such property is the following lower 
bound on the DSD in terms of the trace norm distance T(p, a) := h\\p — er||i: 

Theorem 7 For all density operators p and a and any < a < 1, 

DSD Q (p||a) >4a(l-a)T(p,a) 2 . (29) 

Proof. This follows from Lemma 5 in [30] according to which % 2 (p, cr) > \\p — 
o~\\l. With the substitution a — > r := ap + (1 — a)a and noting that p — r = 
(1 — a)(p — a), the inequality follows. □ 

Our next result provides an upper bound on the DSD in terms of the trace 
norm distance. 

Theorem 8 For density operators p, a > and < a < 1, 

DSD a (p||a)<T(p,a). (30) 



Proof. For any self-adjoint operator X, let X + and X_ denote the positive 
part X + = (X + \X\)/2 and negative part X_ = (\X\ - X)/2. Then another 
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expression for the trace norm distance is 

T (p, a) = Tr (P - ff )+ = Tr (p - <*)- 
We use formula ( 1T9"1) and note that Tr(p — a) = 0: 

DSD a (p||<r) = a Tr p7; p+ (i_ a ) CT (p - a) 
= a Tr(p - (y)T ap+ {i- a)u {p) 

< a Tr(p - a) + T ap+ ( 1 - a )a(p) 

< Tr(p - cr)+7;p + (i_ a ) CT (ap + (1 - a) a) 
= Tr(p-a)+. 

a 
Using the averaging procedure, Theorem [51 we immediately get 
Theorem 9 For density operators p, a > and < a < 1, 
2(1 -a) 2 



-log(a 



-T(p,a) 2 <SD Q (p||a)<T(p j( x). (31) 



To prove the lower bound we note that using (fT2l) the factor 4a (1 — a) averages 
to2(l-a) 2 /(-log(a)). 

The upper bound shows that two states that are close in trace norm distance 
are also close in terms of the skew divergence. In contrast, there is no mean- 
ingful upper bound on the relative entropy in terms of the trace norm distance 
unless the smallest eigenvalues of both states are being controlled [3] • 

Despite the very simple form of the upper bound, it is the strongest one 
possible. Equality can be obtained for any value of t = T(p, a) for states in 
dimension 3 (and higher), for example by choosing p = Diag(t, 0, 1 — t) and 
(j = Diag(0,£, 1-t). 



4 Continuity Properties of SD and DSD 



While the SD inherits many properties from the relative entropy, and improves 
on some of its undesired properties, just like the relative entropy it does not 
satisfy a triangle inequality: SD a (p||r) ^ SD Q ,(p||cr) + SD^o"! \r). However, 
due to the smoothing effects of skewing, two continuity inequalities (in the 
sense of Fannes) can be proven that at least come close in spirit to a triangle 
inequality. To wit, we prove two bounds on the variation of the SD in terms of 
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variations of either of its arguments as measured by the trace norm distance. 
Again no such bounds are possible for the relative entropy without further 
restrictions on the spectra of its arguments. 

The centrepiece in the proofs of these bounds is the following proposition, 
which is a kind of continuity bound for the M-map: 

Proposition 1 For A,B,C > 0, with a = Tr A and c = Tr C , 

etc 

< M A+B (A A) - M A+B+C (A, A) < M a (a, a) - M a+C (a, a) = ——.(32) 

a + c 



The proof of this proposition is postponed to the next section. Note that the 
stronger inequality M A+B (A, A) - M A+B+C (A, A) < M a+b (a, a) - M a+b+c (a, a) 
(with b = Tr B) does not always hold. 

The inequalities of this proposition lead to several inequalities for the DSD. 

Theorem 10 For A,B,C>0 and < a < I, with a = Tr A and c = TiC, 

- DSD Q (0|c) < DSD a (A\\B) - DSD a (A\\B + C) 

<DSD Q (a|0)-DSD a (a|c). (33) 

Proof. Consider first the case A,B,C > 0. The inequality then follows from 
Proposition [1] and expressions ( 12TT) and ( 1231) . We have 

DSD a (A\\B) - DSD a (A\\B + C) 

= YZ~ ( Tr A TaA+(l-a)B(A) - Tr AT a A+(l-a)(B+C)(A)) - a Tr C 

= -!—(TrAT A+1 ^ B (A)-TrAT A+ ^ B+ ^ c (A))-aTrC, 
so that 

DSD a (A\\B) - DSD a (A\\B + C) > -ac = -DSD a (0|c), 
and 



DSD a (A\\B) - DSD a (A\\B + C) 

a ( a 2 a 2 \ 

< — — ac 

1 — a \aa aa + (1 — a)cj 

= DSD Q (a|0)-DSD a (a|c). 
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To treat the case A, B, C > we first use Lemma [T] to bring both terms on a 
'common denominator' as far as supports are concerned. Whereas DSD a (A| \B) 
is defined as DSD a (y4| J 4 +B ||_B| j 4 +B ), and in the second term the operators are 
restricted to the potentially larger supp(yl + B + C), we can write 

DSD Q (A| \B) - DSD a (A| \B + C) = lim DSD a (A\\B + eC) - DSD a {A\\B + C), 

in which the operators in both terms are now restricted to the support of 
A + B + C, allowing to use the positive case, as before. □ 

Theorem 11 For A,B,C>0 and < a < 1, with a = Tr A and c = TrC, 



< DSD a (sp + B) - DSD Q (5 + C\ \A + B + C) (34) 

<DSD a (0|a)-DSD Q (c|a + c). (35) 

Proof. We use the expression (j26|) . and the substitution A' = (1 — a) A: 

DSD a (B\\A + B)- DSD a (£ + C\\A + B + C) 

= «(1 - a) (M a B+(l-a)(A+B) ( A , -4) - ^a(B+C*)+(l-a)(A+B+C) (A A) J 

= (M A , +B (A', A') - M A , +B+C (A', A')) 

1 — a 
<y 
< (M a ,(a' 7 a') - M a , +C ,(a', a')) 

1 — a 

= DSD a (0|a) - DSD Q (c|a + c). 
D 
This immediately yields: 
Proposition 2 For operators A,B,C> 0, with a = Tr A and c = Tr C , 

< SD a (B\\A + B)- SD a (B + C\ \A + B + C) 

<SD a (0|a)-SD a (c|a + c) (36) 

< S(B\\A + B) - S(B + C\\A + B + C) 

<5(0|a)-5(c|a + c). (37) 

Proof. Inequalities (13B]) follow by averaging those of (13"5"j) . 
Then note that 

(-log a) SD a (B||B + A) = S(B\\aB + (1 - a)(S + A)) 

= S{B\\B + (1 - a)A) = S{B\\B + A'). 
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Doing this for all the terms in ( 136]) and dropping primes yields (157)) . □ 
Proposition 3 For A,B,C> 0, with a = Tr A and c = TrC, 

- SD a (0|c) < SD a {A\\A + B)- SD Q (A||A + B + C) 

<-SD a (a|a + c) (38) 

S(0\c) < S(A\\A + B)- S(A\\A + B + C) 

<-S{a\a + c). (39) 

Proof. Consider the inequalities for the DSD of Theorem [10] with B replaced by 
A+B. The proof then proceeds completely similarly as the proof of Proposition 

m □ 

From these theorems, it is easy to derive continuity properties for the DSD 
and the SD. 

Theorem 12 For A,B 1 ,B 2 >0 and < a < 1, with a = Tr A, 

DSD a (A|| J B 1 )-DSD a (A|| J B 2 ) 

< DSD Q (a|0) - DSD Q (a| Tr(5 2 - B x ) + ) + DSD Q (0| Tr(5 2 - B X )J). 

Proof. A successive application of the first and then the second inequality of 
Theorem [10] yields 



DSD a (A||Si)-DSD a (A||S 2 ) 

= DSD a (A||S 1 ) - T>SD a (A\\B x + {B 2 - B x ) + - (B 2 - B x )_) 

< TiSD a (A\\B x ) - DSD^AH^! + {B 2 - B x ) + ) + DSD a (0| Tr(£? 2 - B x )_) 

< DSD Q (Tr A\0) - DSD a (Tr A\ Tr(5 2 - B x ) + ) + DSD a (0| Tr(5 2 - B X )J). 

a 
Theorem 13 For A x ,A 2 ,B>0 and < a < 1, witt 6 = Tr B, 

DSD a (A 1 ||S)-DSD a (A 2 || J B) 

< DSD Q (0|6) - DSD Q (Tr(A 2 - Ai)+|6) + DSD Q (Tr(A 2 - Ai)_|0). 

Proof. This follows immediately from Theorem [12] by the symmetry property 
of DSD, Theorem E] □ 

Again, using the averaging procedure, and specialising to normalised A and 
B, we get the corresponding continuity properties for the skew divergence. 
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Due to the symmetry under exchanging p\ and p 2 , an absolute value sign can 
be inserted. 

Theorem 14 For < a < 1, and for states p 1 , p 2 , o such that T(p 1; p 2 ) = t, 



SD Q ( Pl |]a) - SD a (p 2 \\a)\ < SD Q (0|1) - SD Q (i|l) + SD a (t|0) (40) 

t(log(at + 1 — a) — log(at)) 



- log(a) 



(41) 



It can be checked that the right-hand side is a concave and monotonously 
increasing function of t for any < a < 1. It is also easily verified that 
equality is achieved for p\ _L a and p 2 = to + (1 — t)p\. 

For the relative entropy no such bound is possible, as can be seen by taking 
two different pure states for p and <7i, and a mixed state for <r 2 : for such a 
choice the difference IS'(piller) — *S'(p2 1 |c) | becomes infinite. 

The second continuity inequality is w.r.t. the second argument: 

Theorem 15 For < a < 1, and for states p, a±, cr 2 such that T(o"i, a 2 ) = t, 



SDXK) - SD Q (p||cr 2 )| < SD a (l|0) - SD Q (l|t) + SD a (0|t) (42) 

— log(a + (1 — a)t) 



- log(a) 



(43) 



When p ± <ji and o 2 = tp + (1 — t)ai, equality is achieved. This shows that 
the inequality is sharp, for any a and t. 

Again, for the relative entropy no such bound is possible, as can be seen by 
taking two different pure states for p and a±, and a mixed state for o 2 . 

Note that the right-hand side is also a concave and monotonously increasing 
function of t for all < a < 1. For small t, the bound can be approximated 
by the linear upper bound 

— log(a + (1 — a)t) 1 — a 
— log (a) — —a log a 

The coefficient (1 — a) /(—a log a) is always greater than or equal to 1. It tends 
to +oo in the limit a — > and to 1 in the limit a — ¥ 1. 
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5 Proof of Proposition [T] 



In this section we will present the most technical proof of the whole paper, 
namely the proof of Proposition [TJ As it relies on properties of the second 
derivative of the operator logarithm, we first collect these here, and prove 
some elementary lemmas about it. 

Having defined the linear operator T via the first derivative of the logarithm in 
Section [XT] we can also define a quadratic operator 1Z via the second derivative 
[20]. For A > and A self-adjoint, 



^' A > = -£ 



\og(A + tA). (44) 

t=o 



A simple calculation using the integral representation of the first derivative 
yields the integral representation 

oo 

K A (A) = 2 f ds (A + sI) _1 A(^ + sI) _1 A(.A + si) -1 . (45) 



Here we have used the fact that 

— (A + tA)- 1 = -(A + tAV'AiA + tA)- 1 . 
at 

One can similarly define a bilinear form, for A > and self-adjoint Ai and 

A 2 : 



d 

7e A (Ai,A 2 ): = 



\og(A + iiAi + t 2 A 2 ) (46) 

<i=t 2 =0 



dtidt 

oo 

f ds {A + sI)- 1 A 1 (A + sI)- 1 A 2 (A + sI)- 1 
o 

oo 

+ J ds (A + sI)' 1 A 2 (A + sI)- 1 A l (A + sI)-\ (47) 



Clearly, 



n A (A,A)=TZ A (A), (48) 

^ A (A 1 ,A 2 )=^(A 2 ,A 1 ), (49) 

Tr Aoft A (Ai, A 2 ) = Tr A 2 ^(A , Ai). (50) 



Just as for T A we have 

Lemma 2 For A > 0, TZ A {A) = I. 

Proof. Noting that log (A + tA) = log(l + t)I + log A, we have 



n A {A) = - 



dt 2 



log(A + tA) = I. 



t=o 



D 



Lemma 3 For A > 0, Ra(A,A) =Ta{A). 
Proof. 



K A (A,A) 



d 2 



dt\dti 

d 2 



log(A + hA + t 2 A) 



tl=«2=0 



dt\dt 2 
d 
dt\ 



log(l + ti)I + log(A + t 2 /(l + ti)A) 



ti=t 2 =0 



ti=0 



d£ 2 



log(A + t 2 /(l+t 1 )A). 



*2=0 



The derivative w.r.t. t 2 is, with u = t 2 /(l + ti), 



d 

dt 2 



log(A + t 2 /(l + t 1 )A) 



<2=0 



du 



log(A + uA) 



«=0 



1 + tl 



7i(A) 



l + «i 



Therefore, 



D 



ft A (AA) = - 



tfti 



7i(A) 



il=0 



1 + tl 



Ta(A). 



An immediate consequence is: 
Lemma 4 For A > 0, 

ft A (A + A) = TZ A (A) + 2T A (A) + I > 0. 



Proof. Positivity of 1Z A (A + A) follows from the integral representation (145 
Because of the bilinearity of 11 a(Ai, A 2 ) and Lemmas [2] and [3] we also have 
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TZ A {A + A)=TZ A (A + A,A + A) 

= n A {A, A) + TZ A {A, A) + n A {A, A) + K A (A, A) 

= I + 2T A (A)+n A (A). 

a 
Lemma 5 For A, B > 0, 

Tr(A + B)n A+B (A)=TrAT A+B (A). (51) 

Proof. 

Tr(A + S)^ a +b(A) = Tr(A + B)7^ +B (A, A) 
= TrAK A+B (A + B,A) 

= TtAT a+b (A). 

u 

Lemma 6 For A, B > 0, 

n A+B (A) < I (52) 

Proof. We use the integral representations of T and 1Z. Since A + S + si > B, 
we have (A + B + si)- 1 < B' 1 and B(A + B + sl) _1 £ < B. Therefore, 

oo 

TZ A+B (B) = 2 f ds (A + B + siy 1 B(A + B + sI) _1 B (A + B + si)" 1 
o 

oo 

<2 f ds (A + B + siy 1 B (A + B + si)" 1 

o 
= 2T A+B {B), 

so that 



= I + 2T A+B {-B) + K A+B {-B) 

= I-2T A+B (B)+n A+B (B) 

< I - 2T A+B {B) + 2T A+B {B) = I. 

n 

To prove Proposition [T] we need one more lemma, of a more general nature. 
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Lemma 7 Let f(t) be a real-valued convex function on [0,1]. //, moreover, 
/(0) < and /(O) < /'(O), tfien Vt G [0, 1], /(O) < (1 - *)/(t). 

Proo/. Since /(0) < 0, for all t e [0, 1] we have /(0)/(l - t) < /(0) < f (0). 
Multiplying both sides by t(l-t) yields t/(0) < t{l-t)f{0). Adding (l-t)/(0) 
to both sides gives /(0) < t(l - t)/'(0) + (1 - t)/(0) = (1 - £)(/(0) + tf (0)). 
By convexity of /, /(0) + tf'(0) is a lower bound on f(t), and the inequality 
of the lemma follows. □ 

Proof of Proposition^ Let us state again what we need to prove: for A,B,C > 
0, and with a = Tr A, c = TrC, 

QjC 

< Tr ATa+b ( A) - Tr ATa+b+c (A) < — . (53) 

a + c 



The first inequality in (153]) easily follows from the fact that x \-¥ 1/x is operator 
monotone decreasing together with the identity 

00 

TrXT A (X) = J ds Ti(X l/2 (A + sl)- l X l/2 )\ 
o 

and monotonicity of the function X — > TrX 2 . 

The second inequality involves more work. Let us thereto consider two positive 
density operators p and a, an operator G > p, and the function 



f(t) = -l + 4- Trplog(G + S p + t(a-G)) 
ds s=0 

= Tr pT t(T +{i-t)G (p) ~ 1- 
We will first show that (1 -t)f(t)> /(0) for < t < 1. 
The derivative /'(0) is given by 



f(o) f/ 



Trplog(G + sp + *(o--G)) 

s=t=0 



= Trp7^ G (p, G- a) 
= Tr(G-a)^ G (p). 

By combining Lemma [5] and Lemma [6] we obtain 

Tr(G-a)K G (p)>TrpT G (p)-l, 

which proves that /'(0) > /(0). 
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In t = 0, / takes the value Tr pTc(p) — 1, which is non-positive, since 7b{p) < 
T G (G) = I. Thus /(0) < 0. 

By convexity of the map G i— >■ Tr pTc(p), f{t) is convex. 

By Lemma [7] these three statements imply that (1 — t)f(t) > /(0), for < 
t < 1, i.e. the minimum of (1 — t)f(t) over [0, 1] occurs in t = 0. 

Now let a > 0, c > 0, and t = c/(a + c), which takes values in the interval 
[0, 1], and is iff c = 0. Then 1 — t = a/ (a + c). Also, let G = p + -B/c, with 
B > 0. With these substitutions, we get 



a(l - t)f(t) = Tr apT ap+ B+ca(ap) - 



a 2 



a + c 
and by the above this expression is minimal for c = 0. That is, 

a 2 

Tr apT ap +B+ca{cp) — > Tr apT ap+ B{ap) ~ a, 

a + c 

or, after rearranging terms, 

etc 

Tr apTap +B {ap) ~ Tr apT a p+B+ca{ap) < -—■ 

By introducing A = ap and C = ca we obtain the second inequality of 
(El. □ 



6 Quantum Skew Divergence as a state distinguishability measure 



The quantum relative entropy (QRE) between two quantum states p and 
a, S(p\\a) = Tr p(logp — logo - ), is a non-commutative generalisation of the 
Kullback-Leibler divergence (KLD) KL(p||g) between probability distributions 
p and q, and is widely used as a measure of dissimilarity of quantum states 

ESI. 



Both the KLD and the QRE exhibit a number of features that arise naturally 
from their underlying mathematical model and that may be useful in certain 
circumstances. However, these features also imply that neither the KLD nor 
the QRE is a proper distance measure in the mathematical sense. First of all, 
the KLD and QRE are asymmetric in their arguments. This alone already 
precludes their use as a distance measure, and prompted the terminology KL 
'divergence', rather than KL 'distance'. Secondly, neither obeys the triangle 
inequality. A third feature, and the one considered in this paper, is that the 
KLD is infinite whenever for some i, the probability q(i) is zero when p(i) 
is not. Likewise, S(p\\a) is infinite when the support of p is not contained in 
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the support of a. In particular, this renders the relative entropy useless as a 
useful distance measure between pure states, since it is infinite for pure p and 
a, unless p and a are exactly equal (in which case it always gives 0). It is 
therefore unable to tell by how much two distinct pure states are dissimilar. 

It is illustrative to see how this feature comes about in one of the more impor- 
tant operational interpretations of the KLD and QRE, namely in the context 
of asymmetric hypothesis testing. Let the null hypothesis H be that a ran- 
dom variable X is drawn from the distribution p; the alternative hypothesis 
Hi, that it is drawn from distribution q. A test is to be designed that opti- 
mally discriminates between the two. Two types of error are relevant: a type 
I error (false positive) is when the test selects Hi when in fact H is true; a 
type II error (false negative) is when the test selects H when H\ is true. The 
probability of a type I error is usually denoted by a, and the probability of a 
type II error by 0. These probabilities cannot usually both be made zero, but 
they can be made to both tend to exponentially fast when N, the number of 
samples of X looked at by the test, tends to infinity. One can then define the 
corresponding error rates, a^ and /3r, as the limits a^ = — lim A r_ >00 (l/A^) log a 
and (3r = — limjv->oo(l/-^0 log/3. These rates quantify how fast a and (3 tend 
to with N. 

The KLD can be given a clear operational meaning in this context, as the 
best possible rate 0r when a (not or) is to be kept below a certain value 
e (a value which, surprisingly, does not ultimately enter in the value of the 
optimal /3r). It is now not hard to see why the KLD should be infinite when, 
for some i, q(i) is zero but p(i) is not. In this case an optimal test should only 
look at outcome i. If this outcome occurs, even if only once, this immediately 
rules out the alternative hypothesis. The number of samples required to find 
outcome i amongst them (which depends on p(i)) is finite, therefore the rate 
j3r is infinite. In other words, the infinity of the KLD represents the fact that 
"the theory 'All crows are black' can be refuted by the single observation of a 
white crow". 

Whereas the emergence of this feature of the KLD (and the QRE) seems quite 
natural, it may not always be that desirable. Firstly, the unboundedness of 
the KLD may be a source of numerical instability in applications. Secondly, 
the extreme focus on zeros of q (zero eigenvalues of a, respectively) implies a 
complete disregard of other discriminating information. As stated before, the 
QRE can only tell distinctness of pure states, but not by how much. Thirdly, 
in applications where q is an empirical distribution, the weight put on events 
with q(i) = is totally inappropriate: in empirical distributions this corre- 
sponds to unseen events, not to impossible ones. This is a serious concern in 
applications such as natural language processing [16] . where the events are 
occurrences of word combinations in a large (but not infinitely large) cor- 
pus of text, and in which many genuine but rare word combinations do not 
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occur at allfj. Similar concerns can be raised in the quantum case, when a 
is a reconstructed quantum state obtained from quantum state tomography 
experiments. When maximum likelihood reconstruction of nearly pure states 
produces reconstructed states with one or more zero eigenvalues, these zeroes 
should not be interpreted as zero probabilities. How to properly deal with 
these empirical quantum states is known in the tomography literature as the 
'zero-eigenvalue problem' [6]. A final problem is of a theoretical nature: be- 
cause KLD and QRE can become infinite, it is much harder (and less natural) 
to obtain good upper bounds on these quantities in terms of other distance 
measures. Invariably, some information about the smallest eigenvalues of p 
and a have to be supplied to allow even the existence of such bounds (see, e.g. 

031). 



Several solutions have been put forward to overcome the problems associated 
with this infinity feature, in the classical case and in the quantum case, in the 
form of modifications of the KLD (QRE). In the classical case, one of the first 
to discuss several of these modifications in detail was Lin [22]. In addition to 
the infinity problem, he also considered the asymmetry issue. He introduced 
the following dissimilarity measures based on the KLD, which he called the 
K-divergence and L-divergence, respectively: 

K(p\\q) = S(p\\(p + q)/2) (54) 

L(p,q) = K(p\\q) + K(q\\p) (55) 

= 2H((p + q)/2)-H(p)-H(q). (56) 

Here, H(p) is the Shannon entropy of a distribution, which for the discrete 
case reads H{p) = —Y^iP(i)^ogp(i). Lin also considered a generalisation of 
the L-divergence with different weights, which he called the Jensen-Shannon 
divergence: 

JS a (p, q) = H(ap + (1 - a)q) - aH(p) - (1 - a)H(q). (57) 



Lin pointed out that the K divergence is a special case of the Csiszar f- 
divergences with the function / given by f(x) = xlog(2x/(l + x)) [9]. 



In [16] . Lee introduced a generalisation of Lin's if-divergence that incorporates 
different weights, 

s a (p\\q) = S(p\\aq+(l-a)p), (58) 

a quantity which she called the a-skew divergence. In contrast to Lin's, whose 



2 Consider, for example, the total number of occurrences of the word combination 
"relative entropy" in the combined issues of the New York Times. 
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motivations were mainly theoretical and driven by the lack of good upper 
bounds on the KL divergence, Lee's proposal was fuelled by a practical ap- 
plication in natural language processing: the estimation and subsequent use 
of probabilities of unseen word combinations [T6"fTT] . Here, the asymmetry of 
the KLD had to be maintained but its inordinate focus on zero-probabilities 
had to be alleviated. Lee proposed a 'smoothing' of the q distribution with p 
by mixing a small amount of p into q (she used a = 0.99), in order to shift 
the focus to events that are seen under both distributions. 

In the quantum case, the first attempt to overcome the infinity problem of 
the QRE was undertaken by Lendi, Farhadmotamed and van Wonderen [T8] . 
who proposed to mix both p and o with the maximally mixed state. They 
introduced the regularised relative entropy as 



R{ P \\o-)=c d S (l> J,/ 



1 + d 



<? + h 



l + d 



where d is the dimension of state space, and q is a normalisation constant. It is 
clear that this procedure only works for finite-dimensional states. One might 
also consider mixing both states with a smaller amount of the maximally 
mixed state, for example as a quantum generalisation of Laplace's rule of 
succession for empirical distributions, by which 1 is added to the frequencies 
of all outcomes, in order to properly account for unseen events. 

Another possibility, also applicable to the infinite dimensional case, is to apply 
a smoothing process. One can define the smooth relative entropy between 
states p and a as the infimum of the ordinary relative entropy between p and 
another (unnormalised) state r, where r is constrained to be e-close to a in 
trace norm distance: 

S e (p\\a) = mf{S(p\\T) :r>0,Trr< 1, ||r-cr||i < e} . (59) 



This form of smoothing has already been applied to Renyi entropies and min- 
and max- relative entropy [T0|28f33j . giving rise to a quantity with an oper- 
ational interpretation. However, the process can equally well be applied to 
ordinary relative entropy. 

By far the most popular modification of the QRE in the quantum case is the 
quantum Jensen- Shannon divergence (QJSD) [8fT2"]rT3 .23.29j. which has the 
additional feature of being symmetric in its arguments. It comes in several 
flavours: for pairs of states and equal weights, we have the 'vanilla' style: 



QJS(p, a) = S(p\\\p + \a) + S{a\\\p + \a) (60) 

= S((p + a)/2)-(S( P ) + S(a))/2. (61) 
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Here S(p) is the von Neumann entropy S(p) = — Trplogp. The latter formula 
allows for a straightforward generalisation to general statistical weights, and 
to more than two states: 

n n 

QjS(*i.-.*»)( Pl , ..., Pn ) = S(I>Pi) -J2^ s (Pi)- ( 62 ) 



In the context of quantum channels, this quantity is also known as the Holevo 
X of an ensemble {(pi, 7Ti)}f =1 . 

It seems that in the quantum case, Lee's a-skew divergence has not been stud- 
ied before. It was highly rewarding to discover the many interesting properties 
of the SD, not to mention the applications that will be presented in the next 
sections. 

The SD is closely related to other distinguishability measures. Firstly, it can 
be seen as a simplified version of smoothed relative entropy: to calculate the 
latter a minimisation problem over states r has to be solved. However, there 
is a simple canonical choice for r that achieves the same purpose of regu- 
larisation but without having to find the exact minimiser. Namely, we can 
take that r that lies on the m-geodesic (mixing geodesicjjj from p to a; i.e. 
r = ap+ (1 — a)a. In so doing we obtain exactly the SD with a = e/\\p — er||i. 
For that reason, the SD can be a useful approximation for the smoothed rel- 
ative entropy. Further study will be devoted to the question how good this 
approximation may be. 

The SD is also the non-symmetric distinguishability measure underpinning 
the quantum Jensen-Shannon divergence. It is therefore not surprising that 
mathematical results for the SD lead to useful mathematical results for the 
QJSD and the Holevo x- This is the topic of the next section. 



7 Inequalities for the Quantum Jensen-Shannon Divergence and 
Holevo Information 



Recall that the Quantum Jensen-Shannon Divergence (QJS) of n states pi, 
with weights p i: is equal to the Holevo information, also called Holevo Xi of 
the quantum ensemble £ = {{Pi,Pi)}i=i, and is defined as 

QJS^-<^( Pl , . . . , Pn ) = X (S) = S(J2PiPi) - EPi S (Pi)- (63) 



3 It is not a good idea to choose an e-geodesic (exponential geodesic) here as this 
once again leads to infinities. 
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We will denote by p the probability vector p = (pi, . . . ,p n ). Let the averaged 
state of the ensemble be denoted by p := J2iPiPi- It will also be useful to 
define the complementary states 

_ Po ~ PiPi T,j,j&PjPj 

1 - Pi 1 - Pi 

The Holevo \ can be rewritten in terms of quantum skew divergences as 
follows: 



x(£) = &;S(p;||p ) 

i 

= -J2Pi^og{pi)SD P Xpi\\pi). (64) 

i 

From this representation and the bounds on SD follow several bounds for x 
that improve on existing bounds in the literature. 

Let tij = T(pi,pj) = \\pi — Pj\\i/2, the trace distance between signal states pi 
and pj. Also, let t = maxjjty. From the bound of Theorem [9], SD Q ,(p||cr) < 
T(p, cr), and the convexity of T in each of its arguments, we immediately 
obtain 

x(£) = ~ J2p* lo sfe) SD pApi\\pi) 

i 

<-^p i log(pi)T(p i ,p i ) 

i 

<-^ ft logfe)E^ (65) 

i j^i L Vr 

<H(p)t. (66) 

In the last line, H(p) = ~J2iPi^°s{Pi) is the Shannon entropy of the ensem- 
ble's probability vector. Our bound combines the well-known bound (see, e.g. 
[27], Th. 3.7) x{£) — H(p)i with the bound x(£) < l°g( n ) t of Theorem 14 in 
[8] (only proven for n — 2), and therefore improves on both. 

For binary ensembles, Roga [29] proves the following bound on x{£) i n terms 
of the Uhlmann fidelity F between the two signal states (see also [12] for 
extensions to more than 2 states): 



\(£)<S(a). rr= \ _L_ ^ P)F | (67) 

/p(l — p)F 1 — p 
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where F = F(pi, p 2 ) = Tr Jy/PiP2y/Pi- A numerical investigation showed that 
this gives a bound that is sometimes lower in value than f )66p . which is in terms 
of the trace distance, and sometimes higher. However, when replacing t by its 
upper bound a/1 — F 2 in f )66|) . Roga's bound ( |6"7|) is always better. Which 
bound to choose of course also depends on ease of use and generality. 



Now consider two ensembles £ and £' with the same probabilities Pi, but 
different signal states p, and p[, respectively. Let £« = \\pi — /^||i/2 be the 
trace distance between corresponding signal states. A naive way to obtain a 
bound on \x(£) — x(£')\ m terms of the U would be to use Fannes' continuity 
bound on the von Neumann entropy |llj . However, this would lead to a bound 
that is dimension dependent. Here I show how the two continuity inequalities 
on SD can be used to obtain a dimension-independent bound. Define p' , ~p\ 
analogously, t = T(p , p' Q ) and ti = Tij)^ p^). The distances U can be bounded 
in terms of the tj as 

- < ^r.&Vfo < maxt (68) 

1 - pi ~ r.j# 3 V ; 



To simplify the formulas, we will express everything in terms of the largest tj, 
which we denote by t. 

First consider the difference between terms 



= SD P Xpi\\pi) - SDpXp'iWpi) + SD P Xpi\\pi) - SDpXp'iM) 

< SD Pi (0|l) - SD w (*i|l) + SD Pi (t,|0) + SD Pi (l|0) - SD Pi (l|t 4 ) + SD ft (0|t, 

< SD W (0|1) - SD W (*|1) + SD Pi (t|0) + SD W (1|0) - SD W (1|«) + SD ft (0|«) 

1 (., Pit + 1- Pi pi + {l-pi)t\ 

' t log h log - ' 



-log(pi) V '" Pit ° pi 

Summing over all terms then yields 



\ X (£) - X (£')\<Y.P*tlog (l + l -^ -\ +5: ft log (l + 1 -^t) .(69) 

i V Pi Z / i \ Pi J 

The probabilities pi can be 'eliminated' by exploiting concavity of the loga- 
rithm, giving 



\X(£) ~ x(S')\<tlog(l + (n - I) /t) + log(l + (n - l)t). (70) 
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8 The Small Incremental Mixing Conjecture 

Consider an ensemble of time-dependent states, £ (t) = {(Pj, Pj(t))}™ =1 , where 
each state Pj(t) evolves under the influence of a Hamiltonian Hj] that is, 
Pj(t) = Uj(t)pjUj(t)*, where Uj(t) = exp(itHj). Let p (t) be the ensemble av- 
eraged state, po(t) = YJj=iVjPj{t). We will drop the time argument to indicate 
the state at time 0, pj := Pj(0). 

The mixing rate A(£) of this ensemble is defined as 

d 



A < £ > : M 



SMt)). 

t=0 



Bravyi conjectured in [7] the following upper bound on the mixing rate for 
binary ensembles (n = 2): 

A(£)<ch 2 (p) \\H X -H 2 \l 

where c is a dimension- and state- independent constant, and h 2 (p) is the Shan- 
non entropy of the distribution (p, 1—p). He called this the Small Incremental 
Mixing (SIM) conjecture. Lieb and Vershynina considered this conjecture in 
[2T] and inquired whether this bound could also be valid for larger ensembles 
(n > 2); that is, whether 

A(£)<cH(p), 

where H (p) is the Shannon entropy of the ensemble's probability vector, and 
all the Hamiltonians satisfy ||.Hj|| < 1. 

Bravyi's SIM conjecture was proven very recently by Van Acoleyen et al |31j . 
with a value for the constant c = 9. More details about the physical relevance 
of this conjecture (now a theorem), in particular to entanglement generating 
rates and entanglement area laws, can be found in [71l21f31j . 

In this Section we provide an entirely different proof, and obtain a sharper 
form of the inequality, with constant c = 2. Our approach is based on the 
observation that the mixing rate is equal to the rate of change of the ensemble's 
Holevo information: 



x(S(t)). (71) 



^>-j t 



This follows immediately from the definition fl63|) of x an d the fact that the 
entropy of the signal states Pj(t) does not change under unitary evolution. In 
fact, we have more generally 

S(p (t)) - S(po) = X (£(t)) - x(£(0)). (72) 
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Due to this connection between A and x h is natural to attempt the same 
approach as the one that led to continuity inequality ( |69|) . By (164")) . we have, 
at any point in time t, 

x(S(t))- X (S(0)) 

= - y Zp j to6(p j )(SD Pi (p j (t)\\p j (t)) - SD p .( Pi ||p.)) 

3 

= -Y,P^og(p J )(SD Pj (p J \\U*(t)p ] (t)U J (t)) - SDnfaWPj)). 
i 

In the last line we have exploited unitary invariance of the skew divergence. 
A natural first attempt is to try Theorem [T5J 

X (S(t)) - x(£(P)) <-EPi lo g(Pi)(l - SD p .(%)) 

3 

]_ T) ■ 

where i,- = T(Uj(t)pj(t)Uj(t),~pj). This requires estimating the trace norm 
distances tj but it can already be seen that we will obtain a bound that is too 
weak, due to the occurrence of the factor (1 — Pj)/(— Pjlog(j?j)), which can 
become arbitrarily large for small pj. 

The following theorem is a substantial sharpening of Theorem [15] for the spe- 
cial case that o\ and 02 are unitarily equivalent. 

Theorem 16 For states p and o , for < a < 1, and U = exp(iH), 

SD a (p\\UaU*) - SB a (p\\a) < 2\\H\\. (73) 

This is the key result leading to our proof of the SIM conjecture. 

The proof of this theorem relies on the following simple estimate of the trace 
norm distance between two unitarily equivalent states. 

Lemma 8 For a state p subject to a unitary evolution U(t) = exp(itH), 

T(U(t)pU*(t),p)<t\\H\\. (74) 

Proof. Let pf = U{t)pU*{t). For infinitesimal dt, U = I + idtH and UpU* = 
p + idt [H, p\. Thus ||p' — p\\\ = dt \\ [H,p] ||i < dt 2\\H\\ \\p\\i, where we used 
the triangle inequality for the trace norm, and Holder's inequality. Integrating 
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over t and using the triangle inequality once more shows that this is also true 
for finite t. □ 

Proof of TheoremUM Rather than working with the skew divergence, we con- 
sider the differential skew divergence because of its symmetry property. For 
all quantum states p, o~\ and o~2, an d < a < 1, with t = T(o~i, o~ 2 )i Theorem 
implies 



DSD a (p||o-i) - DSD Q (p||a 2 ) < DSD Q (1|0) - DSD Q (l|t) + DSD a (0|t) 

t 



a + (1 — a)t 
t 



< 

a 

In particular, for <t 2 = o~ and o~\ = UaU*, with U = exp(iH), 

DSD Q (p||C/o-C/*)-DSD Q (p||cr) < -T(UaU*,a) < -\\H\\, 

a a 

where we also have used Lemma [HI 

Using the symmetry property of the DSD, Theorem El we can show that the 
inequality also holds when replacing a in the right-hand side by 1 — a. Indeed, 

DSD a (p\\UaU*) - DSD Q (p||o-) = DSDi_ a (*7oiT||p) - DSD^aHp) 

= DSD 1 _ a (a\\U*pU) - DSD!_ a (cT||p) 

<J—T(U*pU,p)<-^—\\H\\. 
1 — a 1 — a 

Hence, combining the two inequalities yields 

DSD Q (pi I UaU*) - DSDjplla) < min (-, ■——} \\H\\ < 2\\H\\. 

\a 1 — a/ 

Using the averaging procedure then yields the inequality of the theorem. □ 

Theorem 17 (Small Incremental Mixing) Within the setup from above, 
for n = 2, 

S(p (t)) - S(p ) < It h( Pl ,p 2 )\\H l - H 2 \\. (75) 

Proof. As already shown higher up, 

S(p (t)) - S(po) 
= X (S(t))-x(S(0)) 
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= - y £p j log{p J )(SD Pi ip j (t)\\jf j (t)) - SDMfo)) 

j 

= -^p j \og( Pj )(SD Pj (p j \\U*(t)p j (t)Uj(t)) ~ ^pAPjWPj))- 

j 

For n — 2, the complementary state of p\ is p 2 and vice versa] therefore, the 
complementary states evolve unitarily as well. 

S(po(t)) - S(po) 

= -p 1 log(p l )(SD pi (p 1 \\U*(t)U 2 (t)p 2 UZ(t)U 1 (t)) - SD Pl ( Pl ||p 2 )) 
-p 2 \og{ P2 )(SD P2 (p 2 \\U*(t)U l (t)p 1 U*(t)U2(t)) - SD P2 (p 2 || Pl )). 

Thus, in each line we can apply Theorem[Tn]to estimate the differences between 
the skew divergences. 



S(p (t)) - S(p ) < -I»g(p,) *\\Hi -H 2 \\ 

3=1 

= 2tH(p) W^-HzW. 

a 

One may ponder the question if and how our proof could be extended to cover 
the case n > 2. At the very least an alternative to Theorem [TBI would have to 
be found. For n > 2, the spectra of the complementary states are no longer 
constant in time, so Theorem [16] no longer applies and only the more general 
bound of Theorem [15] is at our disposal. As the latter bound is too weak, the 
proof of Theorem [T7] does not immediately generalise to n > 2. 

Moreover, already the very first step of our approach, writing x(0 as the 
weighted sum J2jPjS(Pj\\po) and treating each term separately, turns out to 
be too crude. Consider the simple 2-dimensional, 3-state example 

Pi = |0)(0|, Pl =p, H 1 = 

p 2 = |l)(l|, P2 = (l-p)/2, H 2 = a x 
p 3 = |l)(l|, P3 = (l-*0/2, H 3 = -a x , 

where ax is the A-Pauli matrix. A simple calculation shows that S(pi\\po) = 
— log(p + (1 — p) sin 2 (t)), which has a maximal rate of change of (1 — p)/y/p; 
for small p this can be arbitrarily larger than —log p. Nevertheless, this en- 
semble does not break the SIM conjecture, as further calculations reveal. The 
deleterious effects of the first term are compensated by the remaining terms, 
which have the opposite sign and carry much larger weight. 
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